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Abstract 

Continuous vector representations of words and objects appear to carry 
surprisingly rich semantic content. In this paper, we advance both the con¬ 
ceptual and theoretical understanding of word embeddings in three ways. 
First, we ground embeddings in semantic spaces studied in cognitive- 
psychometric literature and introduce new evaluation tasks. Second, in 
contrast to prior work, we take metric recovery as the key object of study, 
unify existing algorithms as consistent metric recovery methods based on 
co-occurrence counts from simple Markov random walks, and propose a 
new recovery algorithm. Third, we generalize metric recovery to graphs 
and manifolds, relating co-occurence counts on random walks in graphs 
and random processes on manifolds to the underlying metric to be recov¬ 
ered, thereby reconciling manifold estimation and embedding algorithms. 
We compare embedding algorithms across a range of tasks, from nonlin¬ 
ear dimensionality reduction to three semantic language tasks, including 
analogies, sequence completion, and classification. 


1 Introduction 

Continuous vector representations of words, objects, and signals have been 
widely adopted across areas, from natural language processing and computer 
vision to speech recognition. Methods for estimating these representations such 
as neural word embeddings [3, 14, 12] are typically simple and scalable enough 
to be run on large corpora, yet result in word vectors that appear to capture syn¬ 
tactically and semantically meaningful properties. Indeed, analogy tasks have 
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Figure 1: Sternberg’s model for inductive reasoning with embeddings (A, B, 
C are given, I is the ideal point and D are the choices. The correct answer is 
shaded green). 


become de facto benchmarks to assess the semantic richness of word embeddings 
[8, 12]. However, theoretical understanding of why they work has lagged behind 
otherwise intriguing empirical successes. 

Several recent contributions have aimed at bringing a better understanding 
of word embeddings, their properties, and associated algorithms [8, 5, 9, 1]. 
For example, [9] showed that the global minimum of the skip-gram method 
with negative sampling [12] implicitly factorizes a shifted version of the point- 
wise mutual information (PMI) matrix of word-context pairs. Arora et al. [1] 
explored links between random walks and word embeddings, relating them to 
contextual (probability ratio) analogies, under specific (isotropic) assumptions 
about word vectors. 

In this paper, we extend the conceptual and theoretical understanding of 
word embeddings in three ways. First, we ground word embeddings to semantic 
spaces studied in cognitive-psychometric literature and consider three inductive 
reasoning tasks for evaluating the semantic content in word vectors, including 
analogies (previously studied) but also two new tasks, sequence completion and 
classification. We demonstrate existing and proposed algorithms perform well 
across these tasks. Second, in contrast to [1], we take metric recovery as the 
key object of study and unify existing algorithms as consistent metric recovery 
methods based on log(co-occurrence) counts arising from simple Markov ran¬ 
dom walks. Motivated by metric recovery, we also introduce and demonstrate 
a direct regression method for estimating word embeddings. Third, we gen¬ 
eralize metric recovery to graphs and manifolds, directly relating co-occurence 
counts for random walks on graphs and random processes on manifolds to the 
underlying metric to be recovered. 


2 Word vectors and semantic spaces 

Semantic spaces, i.e., vector spaces where semantically related words are close 
to each other, have long been an object of study in the psychometrics and 
cognitive science communities [19, 21]. Rumelhart and Abrahamson proposed 
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that vector word representations derived from semantic similarity along with 
vector addition could predict response choices in analogy questions [19]. This 
hypothesis was verihed in three ways: by solving analogies using embeddings 
derived from survey data; observations that human mistake rates followed an 
exponential decay in embedded distance from the true solution; and ability of 
study subjects to answer analogies consistent with an embedding for nonexistent 
animals after training [19]. 

Sternberg further proposed that general inductive reasoning was based upon 
metric embeddings and tested two additional language tasks, series completion 
and classification (see Figure 1) [21]. In the completion task, the goal is to 
predict which word should come next in a series of words (e.g. given penny, 
nickel, dime, the answer should be quarter). In the classihcation task, the goal 
is to choose the word that best hts a set of given words. For example, given 
zebra, giraffe and goat, and candidate choices dog, mouse, cat and deer, the 
answer would be deer since it fits the first three terms best. Sternberg proposed 
that, given word embeddings, a subject solves the series completion problem by 
hnding the next point in the line defined by the given words, and solves the 
classification task by hnding the candidate word closest to the centroid of the 
given words. A reproduction of Sternberg’s original graphical depiction of the 
three induction tasks is given in Figure 1. As with analogies, we hnd that word 
embedding methods perform surprisingly well at these additional tasks. For 
example, in the series completion task, given “body, arm, hand” we hnd the 
completion to be “hngers”. 

Many of the embedding algorithms are motivated by the distributional as¬ 
sumption [6]: words appearing in similar contexts across a large corpus should 
have similar vector representations. Going beyond this hypothesis, we follow the 
psychometric literature more closely and take metric recovery as the key object 
of study, unifying and extending embedding algorithms from this perspective. 


3 Recovering semantic distances with word em¬ 
bedding 


We begin with a simple model proposed in the literature [2] where word co¬ 
occurences over adjacent words represent semantic similarity and generalize the 
model in later sections. Our corpus consists of m total words across s sentences 
over a n word vocabulary where each word is given a coordinate in a latent 
word vector space {xi,..., Xn} G K'^. For each sentence s we consider a Markov 
random walk, Xi,..., Xm^, with the following transition function 


V{Xt = Xj\Xt-i = Xi) 


exp(-lla:i-a:jlli/cr^) 

ELiexp(-||a:i-a:fcll|/o-2)- 


( 1 ) 
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3.1 Log co-ocurrence as a metric 

Suppose we observe the Gaussian random walk (Eq. 1) over a corpus with 
m total words and define Cij as the number of times for which Xt = Xj and 
Xt-i = Xi} By the Markov chain law of large numbers, as m —> oo, 

-log A \\xi- XjWl/a'^ + \og{Zi) 

where Zi = 5Zfc=i 6xp(—1| [xi — XkW^/cr'^) (See Supplementary Lemma Sl.l). 

More generally, consider the following limit that relates log co-occurence to 
word embeddings 

Lemma 1. Let Cij he a co-occurence matrix over a corpus of size m and x be 
coordinates of words in the latent semantic space, then there exists a sequenee 
a™ and bj^ such that as m ^ oo, 

-log{Cij)-aT^\\xi-Xj\\l + bip. 

The Gaussian random walk above is a special case of this limit; we will show 
that random walks on graphs and some topic models fulfill this metric recovery 
limit. 

3.2 Consistency of word embeddings 

Appying this to three word embedding algorithms, we show the conditions of 
Lemma 1 are sufficient to ensure that the true embedding a; is a global minimum. 
GloVe: The Global Vectors (GloVe) [17] method for word embedding optimizes 
the objective function 

jnin 'Y' f{Cij){2{xi,Cj) + ai + bj - log(Cy))^ 

x,c,a,b 

Ai 

with f{Cij) = min(Gij, 10 )®G xf we rewrite the bias terms as ai =di— jjalilli 
and bj = bj — IjcjUl, we obtain the equivalent representation: 

min^y^ f{Cij){- log(Gij) - pi -Cj\\l -I-A + bjY. 

x.c.a.h ■ ■ 

When combined with Lemma 1, we recognize this as a weighted multidimen¬ 
sional scaling objective with weights f{Cij). Splitting the word vector Xi and 
context vector c* is helpful in practice to optimize this objective, but not neces¬ 
sary under the assumptions of Lemma 1 since the true embedding Xi = Ci = Xija 
and OiAi = 0 is a global minimum whenever dim(a:) = d. (See Thm SI.4 for 
detail) 

^In practice, word embedding methods use a symmetrized window rather than counting 
transitions. This does not change any of the asymptotic analysis in the paper (Supplementary 
section S2) 
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word2vec: 

jective: 


The embedding algorithm word2vec approximates a softmax ob- 


inin Cij log 

X,C ‘ ^ 


f exp((£i,Cj)) \ 

vELiexp((xi,cfc))7 ■ 


If dim('E) = d + 1 we can set one of the dimensions of a; = 1 as a bias term 
allowing us to rewrite the objective with a slack parameter bj analogously to 
GloVe. After reparametrization we obtain that for b = bj — ||cj||2, 


min 

x.c.b 


Cij log 


exp(-| 


Xii C-) 


■bj) 


^,3 


XLiexp(-||Ji - Cfclll + 6fc) 


Since Cij/Yll-i Cik —> 2 \ this is the stochastic neigh- 

i]! Ef=i exp(-|||a;i-a:fc||^/o-2) b 

borhood embedding objective weighted by Yl'k=i Cik- Once again, the true 
embedding x = c = xja is a global minimum (Theorem SI.5). The nega¬ 
tive sampling approximation used in practice behaves much like the SVD ap¬ 
proach [9] and thus applying the same stationary point analysis as [9], the 
true embedding is a global minimum under the additional assumption that 

WxiWl/a = log(Ej Cij/^Y.ij Cij)- 

SVD: The SVD approach [9] takes the log pointwise mutual information matrix: 
Mij = log {Cij^ - log Cik^ -log(^Y Ckj^ + log ( ^ C'b) 

k k ij 


and applies the SVD to the shifted and truncated matrix : (My -|- r)+. This 
shift and truncation is done for computational reasons and to prevent My from 
diverging. In the limit where m —?• oo and no truncation is performed there 
exists a shift t as a function of m such that the algorithm recovers the underlying 
embedding assuming HaiiHl/cr = log(X]j Cy/y^X^y Cij) (Lemma SI.3). 

This assumption can be relaxed via a small modification to the algorithm: 
assume without loss of generality that the latent word vectors are mean-centered. 
Then we create the centered inner product matrix using the centering matrix 

My = VMy V^/ 2 . 

This is exactly classical multidimensional scaling and My —> {xi,Xj) ja'^ since 
the centering removes offsets Oj, bj and norms making SVD of My recover 

Xi and Xj (Theorem SI.2). 


3.3 Metric regression from log co-occurences 

We have demonstrated that by reparametrizing and taking on additional as¬ 
sumptions, existing word embedding algorithms could be cast as metric recovery 
under Lemma 1. However, it is not known if metric recovery would be effective 
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in practice; for this we propose a new model which directly models Lemma 1 
and acts as litmus test for our metric recovery paradigm. 

Lemma 1 describes a log-linear relationship between distance and co-occurences. 

The canonical way to fit such a relationship would be to use a generalized linear 
model, where the co-occurences Cij follow a negative binomial distribution: 

Cij ~ NegBin (0, 6{9 + exp(—||a;j — Xj\\2/‘2 + Ui + bj))~^') . 

Under this overdispersed log linear model, E[Cy] = exp{—\\xi — Xj\\2/‘2+ai + bj), 
Var(C'ij) = ¥\Cij\'^/9 + ¥\Cij\. Here, the parameter 9 controls the contribution 
of large Cij and acts similarly to GloVe’s f{Cij) weight function, which we cover 
in detail below. Fitting this model is straightforward, as we can define the log- 
likelihood in terms of the expected rate Ay = exp(—||xi — Xj |||/2 + ai + bj) 

\Hx,aAe) = ^0iog(0)-0iog(Ay+0)+Cy log (i- ^^)+iog ( r(e{r(ci^?i) ) 

and perform gradient descent over the parameters, giving a simple update for¬ 
mula in terms of the error as 

5ij dbj = Sij 

i 

( 2 ) 

Optimizing this objective using stocahstic gradient descent will randomly select 
word pairs i, j and attract or repulse the vectors x and cin order to achieve the 
relationship in Lemma 1. Our implementation uses the GloVe codebase (section 
S5.1 for details). 

Relationship to GloVe: The overdispersion parameter 9 sheds light on the 
role of GloVe’s weight function f{Cij). Taking the Taylor expansion of the 
log-likelihood at log(Ay) « — log(Cy) we find that for a constant fey, 

\\h{x,a,b,9) = ^fey-^^^^^2^(log(Ay)-log(C'y))2-to((log(Ay)-log(C'y))^). 

Note the similarity of the second order term with the GloVe objective. Both 
weight functions 2 (c'-+ 6 ) — nT'<^x{C^j‘^,Xmax) smoothly asymptote, 

downweighting large co-occurences. However, the empirical performance sug¬ 
gests that in practice, optimizing the distances directly and using the negative 
binomial loss consistently improves performance. 


Sij — 


{Cjj Ay)^ 
Ay- + 9 


dxi = '^{xj-Xi){Sij+Sji) dui = 


4 Metric recovery from Markov processes on graphs 
and manifolds 

Metric recovery from random walks is possible under substantially more general 
conditions than the simple Markov process in Eq 1. We take an extreme view 
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here and show that even a random walk over an unweighted directed graph 
holds enough information for metric recovery provided that the graph itself is 
suitably constructed in relation to the underlying metric.^ To this end, we use 
a limiting argument (large vocabulary limit) with increasing numbers of points 
= {xi, ■ ■ ■ ,Xn}, where Xi are sampled i.i.d. from a density p{x) over a 
compact Riemannian manifold. For our purposes, p{x) should have a bounded 
log-gradient and a strict lower bound po over the manifold. Since the points are 
assumed to lie on the manifold, we use the squared geodesic distance p{xi,Xj)‘^ 
in place of \\xi — Xj\\\ used earlier. The random walks we consider are over 
unweighted spatial graphs defined as 

Definition 2 (Spatial graph). Let cr„ : Xn —> ]R>o be a local seale function 
and h : ]R>o —> [0,1] a piecewise continuous function with sub-Gaussian tails. A 
spatial graph Gn corresponding to (j„ and h is a random graph with vertex set Xn 
and a directed edge from Xi to Xj with probability Pij = h{p{xi, Xj)^/an{xi)^). 

Simple examples of spatial graphs where the connectivity is not random 
(Pij = 0,1) include the e ball graph (crn(x) = e) and the fc-nearest neighbor 
graph (o'n(x) =distance to fc-th neighbor) as in the k-nn graph, cr„ is may 
depend on the set of points X^. 

Our goal is to show that, as n —> oo, we can recover p{xi,Xj) from co¬ 
occurrence counts generated from simple random walks over G„. Log co¬ 
occurences and the geodesic will be connected in two steps. (1) we use known 
results to show that a simple random walk over the spatial graph, properly 
scaled, behaves similarly to a diffusion process; (2) the log-transition probabil¬ 
ity of a diffusion process will be related to the geodesic metric on a manifold. 
(1) The limiting random walk on a graph: Just as the simple random 
walk over the integers converges to a Brownian motion, we may expect that 
under specific constraints the simple random walk over the graph will 
converge to some well-defined continuous process. We require that the scale 
functions converge to a continuous function a {an{x) gn(r{x))\ the size of a 
single step vanish [gn —> 0) but contain at least a polynomial number of points 
within Unix) [gun^w \og[n)~xw —> cjo ). Under this limit, our assumptions 
about the density p{x), and an additional regularity condition, ^ 

Theorem 3 (Stroock-Varadhan on graphs)?, 22]). The simple random walk Xf 
on Gn converges in Skorokhod space D([0,oo),iJ) after a time scaling t = tpn 
to the ltd process Yjr valued in C([0,oo),D) as Xfg -2 —> Yp Yhe process Yj 
is defined over the normal coordinates of the manifold [D,g) with reflecting 
boundary conditions on D as 

_ dYp = V \og [p[YT;))a{Ypfdt + a[Yp)dW^ (3) 

^The weighted graph case follows identical arguments, replacing Theorem 3 with [22, The¬ 
orem 3]. 

^To ensure convergence of densities, require in addition that for t — &{gn )? the rescaled 
marginal distribution n,P(Xt|Xo) is a.s. uniformly equicontinuous. For undirected spatial 
graphs, this is known to be true[4] for spatial graphs, but for directed graphs this is an open 
conjecture highlighted in [7] 
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(2) Log transition probability as a metric We may now use the stochastic 
process iy to connect the log transition probability to the geodesic distance 
using Varadhan’s large deviation formula. 

Theorem 4 (Varadhan [24, 15]). Let Yt be a ltd process defined over a complete 
Riemann manifold {D,g) with geodesic distance p{xi,Xj) then 

lim-t log(P(Ft = Xj\Yo = Xi)) -> p{xi,Xj)‘^. 

This estimate holds more generally for any space admitting a diffusive stochas¬ 
tic process [20]. Taken together, we finally obtain Varadhan’s formula over 
graphs: 

Corollary 5 (Varadhan’s formula on graphs). For any S,j,no there exists some 
t, n > no, and sequence bf such that the following holds for the simple random 
walk Xf: 

p( sup \t\og{V{Xf 2 = Xj \ = Xi))-tb'} - p^(^,,-){xi,Xjf \> S) < j 

Where p^^,c) is the geodesic defined as Pw(x){xi, Xj) = mmf^c^.j(^o)=Xij{i)=xj fg ^{f{t))dt 

Proof. Sketch: For the Ito process, Varadhan’s formula (Theorem 4) implies 
that we can find some time t such that the log-marginal distribution of Y is close 
to the geodesic. To convert this statement to our graph setting, we use the con¬ 
vergence of stochastic processes (Theorem 3) with equicontinuity of marginals 
to ensure that after t = tg~‘^ steps, the transition probability over the graph 
converges to the marginal distribution of Y. Finally, compactness of the domain 
implies that log-marginals converge resulting in Varadhan’s formula for graphs 
(see Corollary S3.2 for details). □ 

Since the co-occurence Cij has the limit log{Cij/ J^kCik) IP(-^t+i = Xj \ 

Xq = Xi), this results in an analog of Lemma 1 in the manifold setting. Our 
proof demonstrates that regardless of the graph weights and manifold structure, 
in the large-sample small-time limit, log co-occurences faithfully capture the un¬ 
derlying metric structure of the data. While there has been ad-hoc attempts to 
apply word embeddings to graph random walks [18], this theorem demonstrates 
that embedding the log co-occurence is a principled method for graph metric 
recovery. 

Generalizing the Markov sentence model: The spatial Markov random 
walk defined above has two flaws: first, cannot properly account for function 
words such as the since whenever the Markov chain transitions from a topic 
to a function word, it forgets the original topic. Second, since the unigram 
frequency of a word is the stationary distribution, frequent words are geomet¬ 
rically constrained to be close to all other words. Both of these assumptions 
can be relaxed by assuming that a latent spatial Markov chain, which we call 
the topic process Yt, generates the observed sentence process Xt- This idea of 
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Google Analogies (cos) Google Analogies {L 2 ) SAT 


Method 

Sem. 

Synt. 

Total 

Sem. 

Synt. 

Total 

h2 

Cosine 

Regression 

78.4 

70.8 

73.7 

75.5 

70.9 

72.6 

39.2 

37.8 

GloVE 

72.6 

71.2 

71.7 

65.6 

66.6 

67.2 

36.9 

33.6 

SVD 

57.4 

50.8 

53.4 

53.7 

48.2 

50.3 

27.1 

25.8 

Word2vec 

73.4 

73.3 

73.3 

71.4 

70.9 

71.1 

42.0 

42.0 


Table 1; Regression and Word2vec perform well on Google and SAT analogies. 


a latent topic model underlying word embeddings has been explored [1]; our 
contributions are threefold: we contextualize this model as part of a metric em¬ 
bedding framework, provide a intuitive proof that directly applies Varadhan’s 
formula, and relax some constraints on the distribution of Y by taking the large 
vocabulary limit (see section S4.1). 

The topic process Yt is defined over by local jumps according to a smooth 
subgaussian kernel h with f ||a;|||h(a;)da; = ao, movement rate cr^ and a log- 
differentible topic distribution w(x) which defines the stationary distribution of 
the current topic over the latent semantic space. 

V{Yt+i\Yt) = h{\\Yt+i - {Yt + y\og{w{Yt))a^)\\ya^) (4) 

Given a topic It, we assume the probability of observing a particular word 
decays exponentially with the semantic distance between the current topic and 
word scaled by a, as well as a non-metric frequency a which accounts for the 
frequency of function words such as the and and. 

V{Xt = Xi\Yt = y) (X aiexp(-||2;t - y\\'^/a^). 


Under this general model, we obtain a heat kernel estimate analogous to Gor 
5, with constraints on the new scale parameter a (Theorem S4.1), 


V{Xt = Xj\Xo 


Xi) oc 


Oij 

■K{Xi) 


w{xi) exp 


\\xj - XiWl \ 

2(a^ + to-^ag) J 


(1 + O(cT^a^t) + 0(u2))+0(t-i/2), 


This allows word embedding algorithms to handle latent processes under the 
same small neighborhood (cr —?• 0), large window t —> 00 limit assuming that (fg 
is small relative to the Hessian of w(x) (See Theorem S4.1 for details). 


5 Empirical validation 

We experimentally validate two aspects of our word embedding theory: the 
semantic space hypothesis, and the manifold Varadhan’s formula. Our goal is 
not to find the absolute best method and evaluation metric for word embeddings, 
which has been studied at detail [10]. Instead we will demonstrate that word 
embeddings based on metric recovery is competitive with existing state-of-the- 
art in both manifold learning and semantic induction tasks. 

^We assume Euclidean, rather than arbitrary manifold since the additivity of vectors im- 
plied by analogical reasoning tasks require Euclidean embeddings 
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5.1 Semantic spaces in vector word representations 

Corpus and training: We trained all methods on three different corpora: 2.4B 
tokens from Wikipedia, 6.4B tokens used to train word2vec, and 5.8B tokens 
combining Wikipedia with GigawordS emulating GloVe’s corpus (section S5.2 
for details). We show performance for the GloVe corpus throughout but include 
all corpora in section S7. Word embeddings were generated on the top lOOK 
words for each corpus using four methods: word2vec, GloVe, randomized SVD 
(referred to as SVD), and metric regression (referred to as regression), (see 
section S5.1). ® 

For fairness we fix the hyperparameter for metric regression at 0 = 50, 
developing and testing the code exclusively on the first 1GB subset of the wiki 
dataset. Vectors used in this paper represent the hrst run of our method on 
each full corpus. For open-vocabulary tasks, we restrict the set of answers to 
the top 30K words which improves performance while covering the majority of 
the questions. 

Solving analogies using survey data alone: We demonstrate that em¬ 
bedding semantic similarity derived from survey data is sufficient for solving 
analogies by replicating a study by Rumelhart and Abrahamson. In this experi¬ 
ment, shown in Table 2, we take a free-association dataset [16] where words are 
vertices on a graph and edge weights Wij represent the number of times that 
word j was considered most similar to word i in a survey. We take this the 
largest connected component of 4845 words and 61570 weights and embed this 
weighted graph using stochastic neighborhood embedding (SNE) and Isomap 
for which squared edge distances are dehned as — \og{wij/ ma,Xki{wki))- Solving 
the Google analogy questions [11] covered by the 4845 words using these vec¬ 
tors shows that Isomap combined with surveys can outperform the corpus based 
metric regression vectors on semantic, but not syntactic tasks; this is due to the 
fact that free-association surveys capture semantic, but not syntactic similarity 
between words. These results support both the semantic field hypothesis, and 
the exponential decay of semantic similarity with embedded distance. 
Analogies: The results on the Google analogies shown in Table 1 demonstrate 
that our proposed framework of metric regression and naive vector addition 
( 1 / 2 ) is competitive with the baseline of word2vec with cosine distance. The 
performance gap across methods is small and fluctuates across corpora, but 
metric regression consistently outperforms GloVe on most tasks and outperforms 
all methods on semantic analogies, while word2vec does better on syntactic 
categories. 

We also evaluate the methods on more difficult SAT type questions [23] 
where a prototype pair A:B is given and we must choose amongst a set of can¬ 
didate pairs [Cl : Di] ... [C 5 : D 5 ]. In this evaluation, cosine similarity between 
vector differences is no longer the optimal choice and L 2 metric performs slightly 
better. In terms of methods, we find that word2vec is best, followed by met¬ 
ric regression. The results on these two analogy datasets show that directly 

®We used randomized, rather than full SVD due to the difficulty of scaling SVD to this 
problem size. For perfomance of full SVD factorizations see [10]. 
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embedding the log-coocurrence metric and taking L 2 distances between vectors 
is competitive with current approaches to analogical reasoning. The consis¬ 
tent improvement of metric embedding over GloVe despite their similarities in 
implementation (Section S5.1), parameters, and stationary point (Section 3.3) 
suggest that the metric embedding approach to word embedding can lead to 
algorithmic improvements. 

Sequence and classification tasks: We propose two new difficult inductive 
reasoning tasks based upon the semantic field hypothesis [21]. The sequence and 
classification datasets, as described in Section 2 are tasks that require one to 
pick either a sequence completion {hour, minute ,...) or find an element within 
the same category out of five possible choices. The questions were generated 
using WordNet semantic relations [13]. These datasets were constructed before 
any embeddings to avoid biasing them towards any one method (Section S5.3 for 
further details). As predicted by the semantic held hypothesis, word embeddings 
solve both tasks effectively, with metric embedding consistently performing well 
on these multiple choice tasks (Table 3). 

The metric recovery approach of metric regression methods and L 2 distance 
can consistently perform as well as the current state-of-the-art on the three 
semantic tasks: Google semantic analogies, sequence, and classihcation. 

5.2 Word embeddings can embed manifolds 

MNIST digits : We evaluate whether word embeddings can perform nonlin¬ 
ear dimensionality reduction by embedding the MNIST digits dataset. Using a 
four-thousand point subset, we generated a k-nearest neighbor graph (fc = 20) 
and generated 10 simple random walks of length 200 from each point resulting 
in 40,000 sentences each of length 200. We compared the four word embedding 
methods against standard dimensionality reduction methods: PCA, Isomap, 
SNE and, t-SNE. The quality of an embedding was measured using the percent¬ 
age of 5-nearest neighbors having the same cluster label. The four embeddings 
shown in Eig. 2 demonstrate that metric regression is highly effective at this 
task, outperforming metric SNE and beaten only by t-SNE (91%), which is a 
visualization method designed for cluster separation. All word embedding meth¬ 
ods including SVD (68%) embed the MNIST digits well and outperform base¬ 
lines of PCA (48%) and Isomap (49%) (Suppplementary Figure SI). This em¬ 
pirically verifies the theoretical predictions in Corollary 5 that log co-occurences 
of a simple random walk converge to the squared geodesic. 


6 Discussion 

Our work further justihes word embeddings by linking them to semantic spaces 
from psychometric literature. The key conceptual glue is metric recovery from 
co-occurrences. The notion of semantic space, as well as our theoretical recovery 
results, suggest the L 2 distance can serve as a natural semantic metric. This is 
reasonably supported by our empirical analysis, including the consistent perfor- 
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GloVe (64.3) Word2vec (68.7) Regression (75.3) 



Figure 2: MNIST digit embedding using word embedding methods (left three) 
and metric embedding on the same graph (right). Performance is quantified by 
percentage of 5-nearest neighbors sharing the same cluster label. 


Manifold Learning Word Embedding Classification Sequence 


Analogy 

Isomap 

SNE 

Regression 

Method 

Cosine 

L2 

Cosine 

L2 

Semantic 

83.3 

21.5 

70.7 

Regression 

84.6 

87.6 

59.0 

58.3 

Syntactic 

8.2 

1.5 

76.9 

GloVE 

80.1 

73.1 

59.0 

48.8 

Total 

51.4 

13.1 

73.4 

SVD 

74.6 

65.2 

53.0 

52.4 





Word2vec 

84.6 

76.4 

56.2 

54.4 


Table 2: Word embedding generated us¬ 
ing human semantic similarity surveys Table 3: Regression with Lg loss 
and manifold learning outperforms word performs well on semantic clas- 
embeddings from a corpus. sihcation and sequence data 


mance of the proposed direct regression method and the utility of Lg distance 
in selecting analogies. 

Our framework highlights the strong interplay between methods for learning 
word embeddings and manifold learning, suggesting several avenues for recover¬ 
ing vector representations of phrases and sentences via properly dehned Markov 
processes and their generalizations. 
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Supplement for: 

Word, graph and manifold embedding 
from Markov processes 


September 18, 2015 


1 Consistency of the global minima of word embedding algorithms 


Lemma Sl.l (Law of large numbers for log coocurrences). Let Xt be a Markov chain defined by the transition 


V{Xt = Xj\Xt-i = Xi) 


exp(-||a;i-a;j||i/q-p 

ELiexp(-||a:i-a:fc||^/cr2) 


( 1 ) 


and Cij be the number of times that Xt = xj and Xt-i = xt over m steps of this chain. Then for any <5 > 0 
and e > 0 there exist some m and constants and b'fi such that 


sup 


- log(C'y) -\\Xi- XjWl/cT^ + of + 5” 


> (5 < £ 


\ *0 I I / 

Proof. By detailed balance we observe that the stationary distribution TTx{xi) exists and is the normalization 
constant of the transition 


V{Xt = Xj\Xt-l = xfi-Kxixi) = 


exp(-| 




W) 


ELiexp(-||a:i-a:fc||i/(T2) 


^exp(-||a;i - XkWl/cr'^) 


k=l 


= P(Wt = Xi\Xt-i = Xj)7rxixj). 


Define m* as the number of times that Xt = xt in a m word corpus. Applying the Markov chain law of large 
numbers, we obtain that for any eg > 0 and (5o > 0 there exists some m such that 


’(^sup TTx{xi) - mt/m > < eg. 


Therefore with probability Eg, mt > m{TTx{xi) — dg). 

Now given m*, Cij ~ Binom(P(Vt = Xj\Xt-i = Xi),mi) applying Hoeffding’s inequality and union 
bounding for any (5i > 0 and £i > 0 there exists some set of rrii such that 


sup 


Ci 


:/mi - 


Xt — Xj I Xt—\ — Xi 


^ < (5i j > (1 — 2 exp(—= ei- 


Since \\xi — Xj \\2 < oo, V{Xt = Xj\Xt-i = Xi) is lower bounded by some strictly positive constant c and 
we may apply the continuous mapping theorem on log(c) uniformly continuous over (c, oo) to obtain that 
for all 62 and £2 there exists some set of nii such that 


sup 

' i,j 


log(Cy) - log(mj) - log(P(Vt = Xj\Xt-i 




> £ 2 - 


Therefore given any d and s for the theorem statement, set 62 = S and £2 = \/£ and define m' as the smallest 
mi required. Since sup^^- \ \xi — a:j|| < 00 , the Markov chain law of large numbers implies we can always find 
some m such that infimj > m' with probability at least p£ which completes the original statement. □ 
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Theorem SI.2 (Consistency of SVD-MDS). Let Cij be defined as above and Mij = log(Cy) and the 
centering matrix V = I — 11^/n. Define the SVD based embedding X as 

XX^ = M = VMVI2. 

Without loss of generality, also assume that the latent vectors x have zero mean, then for any e > 0 and 
(5 > 0, there exists some m, scaling constant a, and an orthogonal matrix A such that 

P('y~] WAxi/a'^ — Xj\\2 > S) < E 

Proof. By Lemma Sl.l we have that 


P ( sup 


- log(Cij) -\\xi- XjWHa^ + of + h 


> (5 1 < £ 


Since mean error cannot exceed entrywise error we can bound the row averages of log(C'ij), where the dot 
product term is zero since x is zero mean. 


P sup 


E,' lOg(C'y) 


Or in other words, — 

P. 
n 


Xi iog(C'i3 


— a; — 


a™ + 


E, 


J 3 


WM 2 


E,-lk, 


Il“'jll2 , „/ '^j^3 

+ 2{Xi, -^— 


> (5 < £ 


Define ML = —\og{C)ij — ——applying the triangle inequality and combining both bounds 


gives 


y . M '. 

Z-^t 11 


Plsupl^^^^- ( bi k ^ 11 ^,II _ 


Efc 


Efclkfclli 


>2(5 < 1 - (1 - £)C 


Note that M' - — M[Jn = 2Mij is the doubly centered matrix as defined above and combining all above 


bounds we have, 


P ( sup 

ij 


A^ij {^Vi , Xj 


>4(5 < 1 - (1 - £)^. 


Given that the dot product matrix has error at most 4(5 the resulting embedding it known to have at most 
\/E error [15]. 

This completes the proof, since we can pick (5 = (5^/4 and £ = 1 — (1 — £)^/^ □ 

Lemma SI.3 (Consistency of SVD). Assume the conditions of Theorem SI.2 and additionally, assume the 
norm of the latent embedding is proportional to the unigram frequency 


EiC, 


■j 


\Xi\\la~= ^ 

Under these conditions. Let X be the embedding derived from the SVD of Mij as 

2XX^ = Mij = \og{Cij) - log ( ^ Cik) - log ( ^ Ckj) + log ( ^ Cij) + r. 

k k ij 

Then there exists a r .such that this embedding is close to the true embedding under the same equivalence 
class as Lemma SI.3 

,2 „ ||2 


“(^ \\ALxila^ ~ XjW'i > < £• 
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Proof. By Lemma Sl.l, for any (5o > 0 and > 0 there exists a m such that 


P I sup 


n 

log(Cij) - \\Xi - Xj||2/cr^ + log (^^exp(-||xi - XkWl/cT^^)^ - log 


k=l 


> (5o < So 


which implies that for any (5i > 0 and £i > 0 there exists a m such that 


P ( sup 
hi 


- log(Cii) “ {W^i - - log(mc) 


> (5i < £1. 


Now additionally, if J2k Cfc/ fjPjPj = IkilP/o'^ then we can rewrite the above bound as 


P sup 
\ hi 


log(C'y) - log Cik^ - log y Ckj^ + log y Cif - 2{xi, Xj)/a^ - log(mc) 

k k ij 

and therefore, 


> (5i < £i. 


P ( sup 
hi 


Mij - 2{xi,Xj) - log(mc) 


> (5i < £i. 


Given that the dot product matrix has error at most (5i, the resulting embedding it known to have at most 
error [15]. 

This completes the proof, since we can pick t = — log(mc), (5i = 5^ and £i = £. □ 

Theorem SI.4 (Consistency of GloVE). Define the GloVe objective function as 

g(x,c,a,b) = f{Cij){2xiCj + a + 6 - log(Cy ))^ 
hi 


Define Xm,Cm,am,bm as the global minima of the above objective function for a corpus of size m. 

Then the parameters derived from the true embedding in Lemma Sl.l, x' = xja, a' = off — 
bt = is arbitrarily close to the global minima in the sense that for any £ > 0 and (5 > 0 there 

exists some m such that 


V{\g{x',x',a',b') - gixm,Cm,am,bm)\ > S) < e 

Proof. Using Lemma Sl.l with error So and probability £o there exists some m such that uniformly over i 
and j, 

i-\\xi - xjWl/a^ + aT + bT+ log(C,))2 < 

Now recall that f{Cij) < = c therefore 

¥{g{x',x',a',h') > cu^Sq) < £o- 

Now the global minima g{xm,, Cm, hm, &m) must be less than g{x', x', a!, b') and we have 0 < g{xm,Cm,(i,m,bm) < 
g{x',x',a',b'). 

Therefore, 

V{\g{x\x\a',b') - g{xm,Cm,am,bm)\ > cn^5o/2) < eg. 

Picking a m such that (5o = 25l{cnf) and £o = £ concludes the proof. □ 

Theorem SI.5 (Consistency of softmax/word2vec). Define the softmax objective function with bias as 


^i:c,,log 

ij \Efc=iexp(-||xi-Cfc||^ + &fe)^ 


Define Xm, Cm, bm as the global minima of the above objective function for a corpus of size m. We claim that 
for any £ > 0 and <5 > 0 there exists some m such that 

f’{\g{x/a,x/(r,Q) - g{x,c,h)\ > 5) < e 
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Proof. By differentiation, any objective of the form 


min Cij log 

Aij 


/ exp(-Aij) \ 

VEfcexp(-Aife)7 


has the minima Ay = — log(Cy) + a* with objective function value Cij ^og{Cij/ ^ik)- This gives a global 
function lower bound 

KZkCik 


g{x, c,b)>^ Cij log (i 

ij A. 

Now consider the function value of the true embedding x/cr; 

( / I o\ n \ ( exp(-||xi - \ 

g{x a, X a, Q) = > Cy log - i-j. - - —rT 2 T^ 

^ VEfcexp - Xi-Xfc ^/cr2 7 


= Y1 

b' 


/ exp(log(C’y) + Sjj + aj) \ 

\ Efc exp(log(Cifc) + Sik + ai)J ' 


We can bound the error variables Sij using Lemma Sl.l as sup^^' |(5yj < So with probability sq for sufficiently 
large m with a* = log(mi) - log(X]Ei “ 2 :fc| l^/cr^)). 

Taking the Taylor expansion at Sij = 0, we have 


g(x/o-, xja, 0 ) = ^ Cy log ( ^ 
ij 

Applying Lemma Sl.l we obtain: 


Ci. 


\J2k^ikJ f^J2k^i 


E 


Cii 


-Sii+om\l 


i^g{x/a,x/a,G) - ^Cy log | > 


Combining with the global function lower bound we have that 

f’{^g{x/a,x/(r,Q) - g{x,c,b) > nJo) < eo- 
To obtain the original theorem statement, take m to fulfil So = 5/n and sq = £■ 


□ 


Note that for negative-sampling based word2vec, applying the stationary point analysis of [7] combined 
with the analysis in Lemma SI.3 shows that the true embedding is a global minima. 

Theorem SI .6 (Metric regression consistency). Define the negative binomial objective function 
A(a;, c, o, b) = exp(—Haij — Cj H 2/2 -I- Uj + bj) 


g{x,c,a,b,6) = ^ 6 'log( 6 >) - 6log{X{xi,Cj,ai,bj) -|- 0 ) -|- Cy log 1 - 


1,3 


e 

X{xi,Cj,ai, bj) + 9 


+ log 


r(Cy + 9) 

r( 0 )r(C'y + 1 ) 


Then the parameters derived from the true embedding in Lemma Sl.l, x' = xja, o' = dC, 6 ' = is 
arbitrarily close to the global minima g{xm,Cm,am,bm) in the sense that for any e > 0 and d > 0 there exists 
some m such that 

V{\g{x',x',a',b') - g{xm,Cm,dm,bm)\ > S) <€ 


Proof. The proof proceeds identically to that of Theorem SI.5. First obtain the global minima at A(xi, Cj, a^, bj) = 

^ij ) 


g{x. 


mi ^mi '-^mi 


) > E 


ij 
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where 


kij = Cij (log(Cy) - \og{Cij +9) + 9{\og{e) - \og{Cij + 9)) + log(r(Cy + 9)) - log(r(e)) - log(Cy + 1)). 

As with Theorem SI.5, rewriting \{xi,Cj,ai,bj) = Cy exp((5ij) allows us to take the taylor expansion for 
exp((5y ) small, giving 

llh(x, a, b,9)=Y^ kij - m 

ij ^ 

Applying Lemma Sl.l we obtain: 

v(^g{x',x',a',b')-'^kij > nSoJ < eo 

-ij 

which when combined with the global function bound yields that the global minima is consistent. □ 

2 Symmetry and windowing co-occurences 

Existing word embedding algorithms utilize weighted, windowed, symmetrized word counts. Let Cjj dehne 
the t-step co-occurence which counts the number of times Xt+t' = Xj and Xf = Xi. 

Then for some weight function w(t) such that w(t) = 1, we define 

OO 

t=i 

This is distinct from our stochastic process approach in two ways: first, there is symmetrization by counting 
both forward and backward transitions of the Markov chain, second, all words within a window of the center 
word Xf are used to form the co-occurences. 

Symmetry: We begin by considering asymmetry of the random walk. If the Markov chain is reversible as 
in the cases of the Gaussian random walk, un-directed graphs, and the topic model, we can apply detailed 
balance to show that the joint distributions are symmetric 

V{Xt+l = Xj\Xt = Xi)'Kx{Xi) = P(A't+l = Xi\Xt = Xj)TTx{Xj) 

Therefore the empirical sum converges to 

Clj + C^ji —t V{Xt+t’ = Xj,Xti = Xi) -|-P(A'i+t/ = XijXf = Xj) = 2V{Xt+t' = Xj,Xt> = xi) 

In the cases where the random walk is non-reversible, such as a fc-nearest neighbor graph then the 
two terms are not exactly equal, however note that if the non-symmetrized transition matricies Cij fulhll 
Varadhan’s formula both ways: 

-tlog{Cij) - ^ \\xi - XjWl + bf and - t\og{Cji) - af ^ \\xj - Xi\\l + b’P' 

The sum Cij will fulhl 

{Cij + Cji) = exp(-||a:i - Xj\\2/t -I- o{l/t)) (exp(oi/t -I- bj/t) + exp{bi/t -I- aj/t)) 

and 

-t\og{Cjj + Cji) = \\xi- XjWl + log (exp(oi/t -I- bj/t) + exp{bi/t -I- aj/t)) t + o(l) 

More specifically, for the manifold case, Oj = log(7rx„) —> log{np{x)/a{xi)‘^) and bj = — log(np(a:)), and 
so the above term reduces to 

-t\og{C/j + Cji) = \\xi - Xj\\l+\og {a~‘^{xi)+ a-^{xj))t + o{l) 

Since the a is independent of t, as t —> 0, we are once again left with Varadhan’s formula in the symmetrized 
case. 
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In practice, this does not seem to affect the manifold embedding approaches much; in the results section 
we attempt embedding the MNIST digits dataset using the fc-nearest neighbor simple random walk which is 
nonreversible. 

Windowing: Now we consider the effect of windowing. We focus on the manifold case for analytic simplicity, 
but the same limits apply to the other two examples of Gaussian random walks and topic models. 

Let qt{x^ x') = P(yt = x\Yq = x') and where Yt fulfills Varadhan’s formula such that there exists a metric 
function p, 

lirn—tlog(gt(a;,a;')) —> p{x,x')‘^ 

Under these conditions, let qt{x,x') = qt'{x,x')/tdt' dehne the windowed marginal distribution. We 
show this follows a windowed Varadhan’s formula. 


\imtqt{x,x') —> p{x,x')‘^ 

This can be done via a direct argument. Varadhan’s formula implies that. 


qtix,x') = exp ) +0 


Thus we can hnd some bounding constants 0 < c = o(l) such that 


^expl- 


p{x, x' 


j lexpi- 


+ £ ) di' 


Performing the bounding integral for general c G K, 

'■*1 / p{x,x’f c\ , _l ( ( p{x,x’f-2c 


-exp,- 


t' t' 


- dt = - exp - 


= - 1 exp 


2t 

C p{x,x'Y 

t t 




2t 


2t 


2c — p{x, x')"^ 


+ 


Therefore we have that for any c. 

By the two-sided bound and c = o(l). 


\imt%{x,x') —7- p{x',x)‘^. 


as desired. 


3 Varadhan’s formula on graphs 

We first prove the convergence of marginal densities under the assumption of equicontinuity. 

Lemma S3.1 (Convergence of marginal densities). Let xq be some point in our domain Xn and define the 
marginal densities 

qt{x) = f‘{Yt = x\Yo = xo) 

qtSx) = nXt = 2 : 1^0 = xo) 


If tuPn = t = 0(1), then under condition (*) and the results of Theorem 3 such that Xf —> Yfi weakly, we 
have 


lim nqt,,{x) 

n—yoo 


%{x) 

p{x) 
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Proof. The a.s. weak convergence of processes of Theorem 3 implies by [2, Theorem 4.9.12] that the empirical 
marginal distribution 

n 

dj-in - ^ ^ Qtn 

i=l 

converges weakly to its continuous equivalent dfi = qi^{x)dx for ly. For any x ^ X and (5 > 0, weak 
convergence against the test function 1^(2;,5) yields 


yeXn,\y-x \<6 


f\y-x\<5 


%{y)dy. 


By uniform equicontinuity of nqt{x), for any e > 0 there is small enough <5 > 0 so that for all n we have 


which implies that 


(ItSy)-\^n^B{x,5)\qt{x) 

V^X'n.,\y-x\<& 


< n ^ \Xn n B{x, (5)|e, 


lim qt {x)p{x)n = lim lim ^<5 '^nqt {x) / p{y)dy 

n^oo S^On^oo J\v-x\<5 

= lim lim Vjf^S~‘^\Xn n B{x,6)\qtjx) = lim [ %^{y)dy = %^{x). 

S^On^oo S^O J\y_x\<S 


We conclude the desired 


lim nqt{x) 


%{x) 

p{x) 


Given this, we can now prove Varadhan’s formula specialized to the manifold graph case: 


□ 


Corollary S3.2 (Heat kernel estimates on graphs). For any 6 > 0,7 > 0,no > 0 there exists some t, n > no, 
and sequence bf such that the following holds for the simple random walk Xf: 


P sup \tlog(V{Xf 2 = Xj I Xq = Xi)) -tb] - p^(^ 2 :){xi,Xjf\ > (5 < 7 

\^Xi,Xj^XnQ J 

Where pa{x) the geodesic defined by a{x): 

Pa(x){xi,Xj) = min [ a{f{t))dt 

f£C^:f(Q)=XiJ(l)=Xj Jo 

Proof. The proof is in two parts. First, by Varadhan’s formula (Theorem 4, [13, Eq. 1.7]) for any (5i > 0 
there exists some t such that: 

sup 1 -t\og(F{Yp=y'\Yo = y)) - Pa{x)iy',yf\ < <^1 

y,V'€D 

Now uniform equicontinuity of marginals implies uniform convergence of marginals (Lemma S3.1) and there¬ 
fore for any <52 > 0 and 70, there exists a n such that, 

P( sup \V{Y-^= Xj\Yo = Xi) - np{xj)V{X^- 2 ,^= Xj\XQ = Xi)\ > 62 ) < Jo 

Xj^Xi^XriQ ” 

By the lower bound on p and compactness of the domain D, P(Ty|Fo) is lower bounded by some strictly 
positive constant c and we can apply uniform continuity of log(a;) over (c, 00) to get that for some (53 and 7, 

P( sup llog(P(V'= XjJVo = Xi)) - \og{np{xj)) - log(P(V” 2:^= ajjJV? = Xi))\ > 5o) < 7- (2) 

Xj,Xi^XnQ 
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Finally we have the bound, 


P( sup 

Xi ,Xj^Xn 


-t\og{f‘{X'^-2^=Xj\XQ = Xi)) -t\og{np{Xj)) - Pa{x){Xi,Xjf 


> (5i + 162 ) < 7 


To combine the bounds, given some 6 and 7, set 6 ” = log{np{xj)), pick t such that (5i < 6/2, then pick n 
such that the bound in Eq. 2 holds with probability 7 and error <53 < 6/{2t). □ 

4 Heat kernel for topic models 

Theorem S4.1 (Heat kernel estimates for the topic model). Let h have smooth subgaussian tails as defined 
by the following conditions; there exists some ip{x) such that snp^ip^x) < 00 and \ \x\\ 2 ‘^^'^tp{x)dx < 00 
such that: 

1. (tail bound) For all \v\ < 6 , \D‘/.h{x)\ < '/’{x) 

2. (convolved tail bound) For all \v\ < 6, for the k-fold convolved kernel h^^\ \D'/.h^^'>{x)\ < k'^f:{K~'^x) 
for 7 > 0. 

Further, if \og{w{x)) has bounded gradients of order up to 6 , then we have the following: 

Let ay = a^a^, where ao = f ||a:|| 2 h(||a:|| 2 )da; then the random walk Yt defined by Equation f admits a 
heat kernel approximation of the marginal distribution at time t in terms of constants 7 Ti{x,y) and vi(x,y), 


sup 

x,y 


1 f-\\vt-y^\\l 


( = Vt\Yo = Vo)- ) (1 + 


<t ^^‘^Tri{yt,Vo)+o{t 


Proof. First, the Stroock-Varadhan theorem [17] implies that after t = ta steps there exists a limiting 
process limo-^o ^ described by the SDE 

dYp = V \og{w{Yp))aldt + aodWp. 

In our case, we can determine the rate of convergence of the marginal distributions of Y) to Y) by an 
Edgeworth approximation due to our smooth tail constraints on h[5. Theorem 4.1]. 

sup |P(Y) = yt\Yo = yo) - V(Yt = yt\Yo = yo) - t~^/^TTi{yt,yo) - t~^Tr 2 iyt,yo)\ < 0{t~^~^) 

Vt,Vo ' ' 

The details of Tri{yt,yo) and 'K 2 {yt,yo) are given in [5, Theorem 4.1], we note that if the drift is constant 
V log(ro(a;)) = c, the marginal of Y) is exactly gaussian and 7 ri(yt, yf) and 'K 2 {yt, yo) are exactly the terms in 
an Edgeworth approximation when applying the central limit theorem to h{x). 

The above approximation is tight as t —> 00 ; however, the marginal distribution of Yi^^ is only Gaussian 
as ta"^ —> 0. We show that this convergence is fast in such that if cr^ is sufficiently small the heat kernel 
is still an useful approximation. 

Let qt{y, x) = P(Y) = yJYo = x) then by the Eokker-Planck equation, this fulfils the following relationship: 


Q _^ Q _^ Q 

g(Vt{yt, yo) = d/ 7 f)X^oQt{yt, yo) + ^og{w{yt))al%{yt, yo) 

i,j ^ * 


dyi 


We use short time asymptotics of second-order elliptic differential equations to obtain the higher order 
expansion [4]: 


0 , X, y) = 


1 


(47rcrgt)'^/2 


exp - 


\Xi - yi\\2 
4(7^? 


J2xj{x,y)F 


vi=o 


Recall that t = a^t. Substituting into the above gives that 


sup 

x,y 


t = yt\Yo = yo) - 


(47rt(T2)d/2 


exp 


yt - Vo \\2 
Aalt 


'^Vjiyt,yo)t^ 


\j=o 


<t ^^^TTi{yt,yo) + o{t 



Finally it suffices to show that vo{x,y) = 1 which follows from the fact that our data lie in Euclidean 
space[16]. In more general manifolds, there will be some curvature associated distortion to the density. □ 

This proof gives the intuition behind Varadhan’s formula. While there are confounders such as the 
kernel h, drift w, and curvature Hess(logw); these issues all dissapear when t —> oo (large window size) and 
a~^ << t (topics remain local). 

Combining this with the emission probability of X gives the appropriate heat kernel estimate directly on 
the observed random walk over words. Applying Theorem S4.1, we obtain the following approximation: 

nx. = = X,) = Jinx, = x,|r, = „mY, = 9,|r„ = 

= —[ [V{Xt = Xj\Yt = yt)ViXi = Xi,Yo = yo)V{Yt = yt\Yo = yo)dyodyt 

'^X \^i) J J 


Where nxi^i) is the unigram frequency. Dealing with the inner integral first, 


/p(x. = x..y„ = »)p(y, = „,|y„ = »),i» 

! J w{yo) ex 


oc ai 


\Xi - 2/0II2 


(47rcr2i)d/2 


exp 


yt - yo \\2 

4a2t 


(1 + o{t))dyo 


= aiw{xi){ 2 TTa)'^^‘^ 


(47rcr2t)'i/2 


exp - 


yt - Xi\\^ 
4<T2t 


^ (1 + 0(aj,t) + 0(?f2) + 0(t-i/2)) 


Where the last approximation is a Laplace approximation for small a taken at Xi [3]. Now applying the 
integral over yt 


V{Xt = Xj\Xo = Xi) (X—^w{xi)exp 


^i \\2 


Tt{Xi) 


2(0- + tal) 


(l + 0(a2t) + 0(u2))+0(t-i/2) 


This has the appropriate form of a heat kernel estimate with the ai = log(ai) + log(w(a:i)) — log(7r(a:i)) 
with two sources of error: too few steps resulting in non-gaussian transitions and too many steps 

introducing distortions 0 {ayt), 


4.1 Relationship to the topic model of Arora et al 

The preprint [1] suggests a latent topic model and consider the following model. Define c* a discrete-time 
continuous space latent topic process with the following restrictions: 

1. (Stationary distribution near zero) the stationary distribution C is a product distribution, and 
Ec~c[|cy] = 1/d and almost surely \ci\ < 2/\/d 


2. (Increments of c have light tails and converge to zero for large corpora) 

Ep(ct+i|ct)[exp(4K|ct+i - ct|i log(m))] < 1 -f £2 
The observed sentence is then generated by 

^ ^ E„exp((^,„,c)) 

Under these conditions, they show that for words w, w' and for sufficiently large corpora, 

log(p(w, w') = ^\\v^ + - 2\ogZ ± 0(1) 
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This model is qualitatively quite similar to our topic model. Condition (1) on the stationary distribution 
is analogous to our limit if —> 0, which ensures the noise term is sharp with respect to our stationary 
distribution w{x). Condition (2) is the increment size constraint, —> 0. 

The conceptual distinction between these two methods is that our topic model arises as a natural exten¬ 
sion of our short time asymptotic manifold analysis. The heat kernel argument gives direct intuition and 
justification for the Gaussian decay of the resulting marginal distribution. 

Examining the models in detail, the two conditions of [1] on the latent topic model are stronger than 
ours in the sense that we do not require quantitative bounds on the stationary distribution or the increment 
size; they may go to zero at any rate with respect to the corpus size. We gain these weaker conditions by 
assuming that the vocabulary size (n in our notation) goes to infinity and taking many steps t —> oo. 

This trade-off between additional assumptions either as direct constraints or additional limits is unavoid¬ 
able. Recall that 

V{Xt = Xj\Xo = Xi) = —^ f f w{yo)V{Xt = Xj\Yt = 2/t)P(W = Xi\Yo = yo)V{Yt = yt\Yo = yo)dyodyt. 

In order to obtain exponential decay on the LHS assuming only exponential decay in the word emissions 
P(W = Xi\Yo = yo), we must either invoke a Guassian limit for PCKt = yt|Fo = Vo) or converge it to a point 
mass relative to P(W = Xi\Yo = yo)- Our use of the large vocabulary limit and the heat-kernel approximation 
allows us to take the former limit, rather than use assumptions to force P(yt = yt|Po = Vo) to a point mass. 
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5 Empirical evaluation details 

5.1 Implementation details 

We used off-the-shelf available implementations of word2vec* and GloVeb In the paper, these two methods 
are run with their standard settings, with two exceptions: GloVe’s corpus weighting is disabled, as this 
generally produced superior results, and GloVe’s stepsizes are reduced as the default stepsized resulted in 
NaN-valued embeddings. 

For all the models we used 300-dimensional vectors, with window size 5. For word2vec we used the 
skip-gram version with 5 negative samples, 10 iterations, a = 0.025 and frequent word sub-sampling with a 
parameter of 10“^. For GloVe we used Vmax = 10, r; = 0.01 and 10 iterations. 

The two other methods (randomized) SVD and regression embedding are both implemented on top of 
the GloVe codebase. For SVD we factorize the PPMI with no shift (t = 0 in our notation from the main 
text) using 50,000 vectors in the randomized projection approximation. For regression, we use ff = 50 and r] 
is line-searched starting at rj = 10. 

5.1.1 Regression embedding 

For regression embedding, we do standard stochastic gradient descent with two differences: first, any word 
co-occurence pairs Cij with counts fewer than ten are skipped with probability proportional to 1 — Cy /lO, 
this is done to achieve dramatic speedups in training time with no detectable loss in accuracy. Second, we 
avoid the problem of stepsize tuning by using an initial line search step comblateined with a linear stepsize 
decay by epoch. Otherwise, initialization and other optimizer choices are kept identical to GloVe. 

5.1.2 Randomized SVD 

Due to the memory and runtime requirements of running a full SVD decomposition, we performed approxi¬ 
mate SVDs using randomized projections. 

For the SVD algorithm of [7], we use the GloVe co-occurence counter combined with a parallel randomized 
projection based SVD factorizer based upon the redsvd library t. We implement resonable best practices of 
[8] of using the square root factorization and no negative shifts. For the number of approximation vectors, 
we tried various sizes and found vector counts past 50,000 offered little improvement. 

5.2 Word embedding corpora 

We used three corpora to train the word embeddings: the full Wikipedia dump of 03/2015 (about 2.4B 
tokens), a larger corpus similar to that used by GloVe [14]: Wikipedia2015 -I- Gigaword5 (5.8B tokens in 
total) and the one used word2vec [10], which consists of a mixture of several corpora from different sources 
(6.4B tokens in total). 

We preprocessed all the corpora by removing punctuation, numbers and lower-casing all the text. Finally 
we ran two passes of word2vec’s tokenizer word2phrase. As a hnal step, we removed function words from 
the vocabulary and kept only the lOOK most common words for all our experiments. 

5.3 Datasets for semantic tasks 

Our first set of experiments is on two standard open-vocabulary analogy tasks: Google [9] and MSR [11]. 
Google consists of 19,544 semantic and syntactic analogy questions, while MSR’s 8,000 questions are all 
syntactic. As an additional analogy task, we use the SAT analogy questions (version 3) of Turney [18]. 
The dataset contains 374 questions from actual SAT exams, guidebooks, from the ETS web site and other 
sources. Each question consists of 5 exemplar pairs of words wordl:word2, where all the pairs hold the same 
relation. The task is to pick from among another five pairs of words the one that best represents the relation 

*http: //code.google.com/p/word2vec 
^http: //nip.stanford.edu/projects/glove 
^ https: / / github.com / ntessore / redsvd-h 
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represented by the exemplars. To the best of our knowledge, this is the first time word embeddings are used 
to solve this task. 

Given the current lack of freely available datasets with category and sequence questions, as described 
in Section 2, we decided to create them. We used nltk’s^ interface to WordNet [12] in combination with 
word-word PMI values computed on the Wiki corpus to create the sequences and classes. 

As a first step, we collected a set of root words from other semantic tasks to initialize the methods. For 
the classification data, we created the in-category words by selecting words from various WordNet relations 
associated to the root words, after which we pruned down to four words based on PMI-similarity to the root 
word and the other words in the class. The additional options for the multiple choice question were created 
searching over words related to the root by a different relation type, and selecting those most similar to the 
root. 

For the sequence data, we obtained from WordNet trees of words given by various relation types, and 
then pruned based on similarity to the root word. For the multiple-choice version of the data, we selected 
additional (incorrect) options by searching over other words related to the root word, and pruning, as for 
sequences, based on PMI similarity. 

Finally, we manually pruned all three sets of questions, keeping only the most coherent questions, in 
order to increase the quality of the datasets. After pruning, the category dataset was left with 215 questions 
and the sequence dataset with 51 questions in its open-vocabulary version and 169 in its multiple choice 
version. 

The two datasets will be made available for others to experiment with. We hope that they help broaden 
the type of tasks used to evaluate semantic content of word embeddings. 

5.4 Solving classification and series completion tasks 

In each task we obtain an ideal point via the following vector operations. 

• Analogies: Given A:B::C form the ideal point hy B — A + C following the parallelogram rule. 

• Analogies (SAT): Given A:B and candidates Ci : Hi ... C„ : form the ideal point hy B — A and 

represent the options as Di — Ci. 

• Categories: Given a set wi,... ,Wn defining a category, we define the ideal to be / = ^ 

• Sequence: Given sequence Wi : • • • : we compute the ideal as I = Wn + ^{wn — Wi). 

5.5 Similarity metrics for verbal reasoning task 

Given the ideal point / of a task and options (possibly the entire vocabulary) we pick the answer by proximity 
of the ideal point I measured in three possible ways. 

• Cosine: We hrst unit-normalize each vector as Wi/||wi ||2 and use cosine similarity to choose which 
vector is closest to the ideal. 

• L 2 : We do not apply any normalization, and pick the closest vector by L 2 distance. 

• Diff-cosine (SAT only): For the SAT, the differences of the vectors are normalized, and similarity 
is masured by cosine distance. 

In our experiments we found cosine and L 2 to give reasonable performance under all tasks. The pre¬ 
normalization of cosine vectors are consistent to what was done in [10, 6]). For the L 2 distance we applied 
no normalization. 


§ http://www.nltk.org/ 
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Figure 1: MNIST digit embedding using word embedding method and metric embedding on the same graph. 
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7.1 


Top-30k vocabulary resctriction 


Google Analogies SAT MSR Analogies 



Semantic Syntactic 

Total 




Covered 

5022 8195 

13217 

217 


4358 

Total 

8869 10675 

19544 

374 


8000 


Table 1; glove corpus question coverage 



Google Analogies 


SAT 

MSR Analogies 


Semantic Syntactic 

Total 




Covered 

4746 7679 

12425 

199 


4340 

Total 

8869 10675 

19544 

374 


8000 


Table 2: wiki corpus question coverage 



Google Analogies 


SAT 

MSR Analogies 


Semantic Syntactic 

Total 




Covered 

3965 8447 

12412 

257 


4554 

Total 

8869 10675 

19544 

374 


8000 


Table 3: w2v corpus question coverage 


Method 

Google Analogies (cosine) 

Google Analogies (12) 


SAT 


MSR Analogies 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

78.4 

70.5 

73.5 

74.1 

70.0 

71.2 

38.7 

41.7 

33.7 

67.2 

64.0 

GloVE 

70.2 

70.9 

70.6 

59.2 

67.7 

64.5 

37.2 

40.7 

35.7 

61.2 

53.5 

SVD 

55.8 

46.4 

50.0 

49.1 

41.2 

44.3 

32.7 

32.2 

28.1 

33.5 

30.3 

word2vec 

68.0 

73.8 

71.6 

66.6 

71.2 

69.4 

41.9 

42.9 

41.4 

65.0 

63.4 


Table 4: wiki corpus analogy accuracy 


Classification Sequence Sequence (open vocab) Sequence (open vocab, top5) 


Method 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

86.1 

85.6 

58.0 

55.6 

7.8 

5.9 

72.5 

60.8 

GloVE 

80.9 

76.7 

59.2 

51.5 

2.0 

2.0 

51.0 

37.3 

SVD 

74.9 

64.7 

46.2 

46.2 

2.0 

2.0 

21.6 

25.5 

word2vec 

85.1 

71.6 

57.4 

59.2 

2.0 

5.9 

49.0 

51.0 


Table 5: wiki corpus for classification and sequence 


Google Analogies (cosine) Google Analogies (12) SAT MSR Analogies 


Method 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

78.4 

70.8 

73.7 

75.5 

70.9 

72.6 

39.2 

40.6 

37.8 

65.6 

63.9 

GloVE 

72.6 

71.2 

71.7 

65.6 

66.6 

67.2 

36.9 

42.8 

33.6 

62.0 

55.6 

SVD 

57.4 

50.8 

53.4 

53.7 

48.2 

50.3 

27.1 

32.2 

25.8 

32.0 

30.6 

word2vec 

73.4 

73.3 

73.3 

71.4 

70.9 

71.1 

42.0 

49.2 

42.0 

67.9 

66.5 


Table 6: glove corpus analogy accuracy 





Classification Sequence Sequence (open vocab) Sequence (open vocab, top5) 


Method 

Cosine 

L2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

84.6 

87.6 

58.9 

58.3 

0.0 

0.0 

23.5 

21.6 

GloVE 

80.1 

73.1 

58.9 

48.8 

0.0 

0.0 

27.5 

23.5 

SVD 

74.6 

65.2 

53.0 

52.4 

0.0 

2.0 

19.6 

15.7 

word2vec 

84.6 

76.4 

56.2 

54.4 

0.0 

3.9 

53.0 

58.8 


Table 7: glove corpus for classification and sequence 


Google Analogies (cosine) Google Analogies (12) SAT MSR Analogies 


Method 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

80.1 

73.0 

75.2 

77.3 

73.1 

74.4 

38.1 

43.0 

36.9 

69.4 

68.4 

GloVE 

70.4 

73.0 

72.2 

61.9 

70.0 

67.2 

36.9 

43.9 

34.0 

66.4 

61.6 

SVD 

55.2 

43.6 

54.1 

52.8 

50.6 

51.3 

27.9 

37.3 

29.1 

35.6 

35.4 

word2vec 

66.8 

73.4 

71.3 

67.2 

72.2 

70.6 

39.0 

46.4 

42.3 

75.3 

75.6 


Table 8: w2v corpus analogy accuracy 

Classification Sequence Sequence (open vocab) Sequence (open vocab, top5) 


Method 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

81.4 

85.5 

57.1 

55.4 

0.0 

0.0 

25.5 

21.6 

GloVE 

78.2 

70.0 

57.7 

50.6 

2.0 

0.0 

31.4 

31.4 

SVD 

74.1 

61.1 

47.0 

48.2 

0.0 

0.0 

35.3 

21.6 

word2vec 

87.0 

75.0 

52.7 

50.9 

3.9 

5.9 

49.0 

45.1 


Table 9: w2v corpus for classification and sequence 


7.2 Top-lOOk vocabulary 

Google Analogies SAT MSR Analogies 

Semantic Syntactic Total 

Covered 7829 10411 18240 217 5612 

Total 8869 10675 19544 374 8000 

Table 10: glove corpus question coverage 


Google Analogies SAT MSR Analogies 

Semantic Syntactic Total 

Covered 7667 10231 17898 199 5186 

Total 8869 10675 19544 374 8000 

Table 11: wiki corpus question coverage 


Google Analogies SAT MSR Analogies 

Semantic Syntactic Total 

Covered 7213 10405 17618 244 5462 

Total 8869 10675 19544 374 8000 


Table 12: w2v corpus question coverage 




Method 

Google Analogies (cosine) 

Google Analogies (12) 


SAT 


MSR Analogies 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

76.9 

64.6 

69.9 

64.9 

62.5 

63.5 

38.7 

41.7 

33.7 

62.6 

57.4 

GloVE 

69.0 

66.0 

67.3 

53.5 

62.1 

58.4 

37.2 

40.7 

35.7 

58.6 

50.2 

SVD 

53.8 

40.2 

46.1 

40.2 

34.2 

36.8 

32.7 

32.1 

28.1 

31.3 

26.7 

word2vec 

67.9 

70.4 

69.3 

67.4 

67.2 

67.3 

41.7 

43.2 

41.2 

62.4 

61.5 


Table 13: wiki corpus analogy accuracy 


Method 

Glassification 

Sequence 

Sequence (open vocab, top 5) 

Sequence (open vocab) 

Gosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

86.0 

85.6 

58.0 

55.6 

62.7 

37.3 

11.8 

15.7 

GloVE 

80.9 

76.7 

59.2 

51.5 

51.0 

37.3 

3.9 

3.9 

SVD 

74.9 

64.7 

46.2 

46.2 

21.6 

25.5 

3.9 

3.9 

word2vec 

85.1 

71.6 

45.1 

43.1 

43.1 

45.1 

3.9 

11.8 




Table 14: 

wiki corpus 

for classification and 

sequence 



Method 

Google Analogies (cosine) 

Google Analogies (12) 


SAT 


MSR Analogies 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

75.0 

66.4 

70.1 

70.0 

66.1 

67.7 

39.2 

40.6 

37.8 

62.2 

58.9 

GloVE 

70.7 

67.5 

68.8 

62.5 

62.4 

62.5 

36.9 

42.9 

33.6 

61.0 

53.0 

SVD 

57.0 

44.2 

50.3 

47.9 

42.0 

44.5 

27.2 

32.3 

25.8 

30.6 

27.4 

word2vec 

71.7 

71.5 

71.5 

70.0 

68.7 

69.5 

42.1 

48.2 

41.7 

67.0 

66.8 


Method 

Classification 

Table 15: glove corpus analogy accuracy 

Sequence Sequence (open vocab top 5) Sequence (top 1) 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

84.6 

87.6 

58.9 

58.3 

23.5 

17.6 

0.0 

0.0 

GloVE 

80.1 

73.1 

58.3 

48.8 

27.5 

23.5 

0.0 

0.0 

SVD 

74.6 

65.1 

55.6 

54.4 

19.6 

11.8 

0.0 

3.9 

word2vec 

84.6 

76.4 

55.6 

54.4 

49.0 

54.9 

0.0 

7.8 


Table 16: glove corpus for classification and sequence 


Google Analogies (cosine) Google Analogies (12) SAT MSR Analogies 


Method 

Semantic 

Syntactic 

Total 

Semantic 

Syntactic 

Total 

L2 

diff-cosine 

cosine 

cosine 

L 2 

regression 

78.2 

68.9 

72.7 

72.0 

68.6 

70.0 

38.1 

43.0 

36.9 

66.1 

63.1 

GloVE 

70.6 

69.8 

70.1 

61.2 

65.7 

63.9 

36.9 

53.9 

34.0 

65.3 

59.0 

SVD 

55.9 

47.8 

51.1 

45.4 

44.7 

45.0 

27.9 

37.2 

29.1 

33.6 

31.1 

word2vec 

67.1 

71.6 

69.8 

68.0 

70.4 

69.4 

39.2 

47.1 

42.8 

73.8 

74.6 


Method 

Classification 

Table 17: w2v corpus analogy accuracy 

Sequence Sequence (top 5) Sequence (top 1) 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

Cosine 

L 2 

regression 

81.3 

85.5 

57.1 

55.4 

24.5 

21.6 

0.0 

0.0 

GloVE 

78.2 

70.0 

58.3 

50.6 

31.4 

31.4 

3.9 

0.0 

SVD 

74.1 

61.1 

45.8 

48.2 

31.4 

21.6 

0.0 

0.0 

word2vec 

87.0 

75.0 

53.3 

50.9 

43.1 

35.3 

7.84 

11.8 


Table 18: w2v corpus for classification and sequence 
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